Stratified sampling of log records for approximate full-text search

ABSTRACT

A log record from a host machine node includes an invariant string and a term. A template identifier is selected, from among template identifiers within a template repository, for a template string matching the invariant string. A sampling count threshold is selected from among a set of sampling count thresholds based on the template identifier and the term. A template-term count is obtained based on a number of earlier log records that were received since the count was reset and have a template identifier and a term that match the template identifier and the term of the log record. Based on the template-term count satisfying the sampling count threshold, an index entry is generated in a sampled log records index based on the log record and the template-term count is reset to a defined value. Based on the template-term count not satisfying the sampling count threshold, the template-term count is incremented.

TECHNICAL FIELD

The present disclosure relates to computer systems and more particularlyto analysis of a stream of log records from computer equipment.

BACKGROUND

Data centers can contain thousands of servers (both physical and virtualmachines), with each server running one or more software applications.The servers and software applications generate streams of log records toindicate their operational states and progression. For example, softwareapplications may output log records that sequentially list actions thathave been performed and/or list application state information at variouscheckpoints or when triggered by defined events (e.g., faults)occurrence, etc.

These log records are stored and searched by systems operators forvarious purposes—e.g., to detect anomalies, troubleshoot problems, mineinformation, check the health of the servers, etc.

In existing systems, the log records are stored in a log recordrepository, which may be a full-text index (FTI). An FTI allows complextext queries to be performed on the log records. The storagerequirements of an FTI are proportional to the amount of content in eachof the log records. The log records can be generated on the order ofmillions per second for large data centers. At these rates, storing thelog records efficiently (both in terms of space and time), while alsoallowing for efficient searches, can be a significant challenge.

SUMMARY

Some embodiments disclosed herein are directed to a method by acomputer. The method includes receiving a log record as part of a streamof log records from a host machine node. The log record includes aninvariant string and a term. The method further includes selecting atemplate identifier, from among a plurality of template identifierswithin a template repository, for a template string matching theinvariant string of the log record. A sampling count threshold isselected from among a set of sampling count thresholds based on thetemplate identifier and the term of the log record. A template-termcount is obtained based on a number of earlier log records that werereceived since the count was reset and have a template identifier and aterm that match the template identifier and the term of the log record.Based on the template-term count satisfying the sampling countthreshold, an index entry is generated in a sampled log records indexbased on the log record and the template-term count is reset to adefined value. Based on the template-term count not satisfying thesampling count threshold, the template-term count is incremented.

Some other embodiments disclosed herein are directed to a computerprogram product that includes a computer readable storage medium havingcomputer readable program code embodied therewith. The computer readableprogram code includes computer readable program code to receive a logrecord as part of a stream of log records from a host machine node, thelog record comprising an invariant string and a term, to select atemplate identifier, from among a plurality of template identifierswithin a template repository, for a template string matching theinvariant string of the log record, and to select a sampling countthreshold from among a set of sampling count thresholds based on thetemplate identifier and the term of the log record. Further computerreadable program code obtains a template-term count based on a number ofearlier log records, which have been received since the count was reset,that have a template identifier and a term that match the templateidentifier and the term of the log record. Further computer readableprogram code is to, based on the template-term count satisfying thesampling count threshold, generate an index entry in a sampled logrecords index based on the log record and resetting the template-termcount to a defined value. Further computer readable program code is to,based on the template-term count not satisfying the sampling countthreshold, increment the template-term count.

BRIEF DESCRIPTION OF THE DRAWINGS

Aspects of the present disclosure are illustrated by way of example andare not limited by the accompanying drawings. In the drawings:

FIG. 1 is a block diagram of a system containing a log stream analysiscomputer which includes a stratified sampling computer that determinesbased on various template-term sampling count thresholds when logrecords of a stream are to be indexed in a sampled log records index, inaccordance with some embodiments;

FIG. 2 is a flowchart of operations by the stratified sampling computerfor determining based on various template-term sampling count thresholdswhen log records of a stream are to be indexed in a sampled log recordsindex, in accordance with some embodiments;

FIG. 3 is a flowchart of operations by a search engine for searching asampled log records index, based on a search term and time perioddefined by a received search query, to retrieve template strings andterms which are used to generate log records, and identifying which ofthose log records satisfy the search term, in accordance with someembodiments; and

FIG. 4 is a block diagram of a log stream analysis computer configuredaccording to some embodiments.

DETAILED DESCRIPTION

In the following detailed description, numerous specific details are setforth in order to provide a thorough understanding of embodiments of thepresent disclosure. However, it will be understood by those skilled inthe art that the present invention may be practiced without thesespecific details. In other instances, well-known methods, procedures,components and circuits have not been described in detail so as not toobscure the present invention. It is intended that all embodimentsdisclosed herein can be implemented separately or combined in any wayand/or combination.

Some embodiments are disclosed herein in the context of the nonlimitingexample block diagram of FIG. 1. A log stream analysis computer 100receives log streams from one or more software sources executed by eachof one or more host machine nodes 10. In the embodiment of FIG. 1, thelog stream analysis computer 100 receives a log stream from Source ID_1executed by the host machine node 10 identified by a Host ID, and canfurther receive log streams from other software sources executed by thesame or other host machine nodes. A host machine node is also referredto as a “host node” and “host” for brevity.

A host machine node can include a physical host machine and/or a virtualmachine (VM). The physical host machine includes circuitry that performscomputer operations to execute one or more software sources. Thephysical host machine may include, without limitation, network contentservers (e.g., Internet website servers, movie/television programmingstreaming servers, application program servers), network storage devices(e.g., cloud data storage servers), network data routers, networkgateways, communication interfaces, program code processors, datamemories, display devices, and/or peripheral devices. The physical hostmachine may include computer resources such as: processor(s), such as acentral processing unit (CPU); network interface(s); memory device(s);mass data storage device(s), such as disk drives, solid statenonvolatile memory, etc.; etc.

A physical host machine can provide one or more VMs that execute one ormore software sources. A virtual hypervisor can provide an interfacebetween the VMs and a host operating system that allows multiple guestoperating systems and associated software sources to run concurrently ona physical host machine. The host operating system is responsible forthe management and coordination of activities and the sharing of thecomputer resources of the physical host machine.

Each software source belongs to a source type. For example, a “SQLServer” may be a source type and each installation of SQL Server is asoftware source belonging to the source type. Multiple sources of thesame or different source types may be on the same host, and a softwaresource may migrate between hosts. Each host and software source isidentified by a unique identifier, Host ID and Source ID respectively. Alog stream (generated by a software source of a particular host) can beuniquely identified by a compound identifier generated from combinationof the Host ID and Source ID, or in some embodiments may be uniquelyidentified by an identifier generated from the Source ID.

The log stream analysis computer 100 includes a log record processor 110that partitions the received log stream into a sequence of log recordsaccording to a defined time interval (e.g., a defined number of seconds,minutes, or hours) or other defined event or rule. Each log record maybe uniquely identified by an identifier (LogID) that is formed from acombination (e.g., concatenation) of the corresponding (Log Stream ID)from which the record was partitioned and a timestamp associated withthe defined time interval or other defined event or rule. When a singleblock stream is received from a host machine node, the log record may beuniquely identified based on the timestamp alone. The interval size canbe determined based on a trade-off analysis between storage spacerequirements and accuracy.

The log record processor 110 may store content of the log records in alog record repository 112. The log records may be indexed, such as by aninverted index, to facilitate searches among log records in the logrecord repository 112. As used herein, an index can provide a datastructure which associates content of log records, such as strings ofletters, words, and/or numbers, to their locations in the log records.The index allows running full-text queries on the log records. However,the storage needed for the index is a significant fraction of the inputdata size. The indexing time, storage and the search time are allproportional to the number of log records. In some distributed computingsystems, such as large data centers, log records are generated on theorder of millions per second. It can therefore be overly time-consumingand/or require excessive processing resources to conduct searches in thelog record repository 112.

In accordance with at least some embodiments disclosed herein, astratified sampling computer 120 operates to index log records only whenvarious sampling count thresholds are satisfied which are defined basedon content of the log record. Each log record includes an invariantstring and a term. Different sampling count thresholds may be associatedwith different unique combinations of the invariant strings and theterms occurring in log records. For example, 100 percent sampling may beprovided for log records containing one combination of invariant stringand term which occur very infrequently, and 0.001 percent sampling maybe provided for log records containing another combination of invariantstring and term which occur much more frequently. The sampled logrecords can be indexed in a sampled log records index 130. Accordingly,log records containing frequently occurring combinations of invariantstrings and terms can be rarely indexed in the sampled log records index130 while log records containing another combination of invariant stringand term can be always indexed, or a much greater percentage of theiroccurrence, in the sampled log records index 130. At least one logrecord for each occurrence of a combination of invariant string and termmay be indexed in the sampled log records index 130.

The rate at which log records are indexed in the sampled log recordsindex 130 can be controlled inversely proportional to how oftenparticular combinations of invariant strings and terms occur in thoselog records. Controlling the indexing rate based on how often aparticular combination of invariant string and term occurs in logrecords enables differentiated rates of indexing between one category oflog records containing a term that occurs infrequently in combinationwith one defined invariant string, and another category of log recordscontaining the term that occurs very frequently in combination withanother defined invariant string. Indexing a higher percentage of logrecords containing combinations that occur infrequently may enablehigher fidelity operational analysis of operational states captured inthose log records. Indexing a much lower percentage of log recordscontaining combinations that occur frequently reduces the associatedcontent volume of the sampled log records index 130, and enables thecontent to be more rapidly searched (e.g., real-time searches) and usedto generate log records.

The rate at which log records are indexed can be controlled based onknowledge of how frequently various combinations of invariant stringsand terms in a stream of log records are desired to be included in thesample log records index 130 to enable, for example, analysis of thoselog records (e.g., root error cause analysis). Infrequent occurrences ofa particular combination of invariant string and term can denoteanomalies, errors, warnings, crashes, etc with operation of the hostmachine node and/or a source software application, and therefore, shouldbe indexed more often when they do occur. The overhead for storing logrecords responsive to their containing an infrequently occurringcombination of invariant string and term is accordingly reduced whilethe amount of information captured for subsequent high fidelity analysisof associated events is increased.

In one embodiment, the stratified sampling computer 120 receives a logrecord as part of a stream of log records from the host machine node 10.The log record includes an invariant string and a term. Some invariantstrings contained in the log records correspond to template stringswithin a template repository 122. Unique ones of the invariant stringsobserved in the log records can be stored in the template repository 122where they are logically associated with unique template identifiers.

Each template string can, for example, correspond to a “print” or otheroutput routine in the software code of the source, outputs an invariantstring (e.g., which does not vary between “prints” by the same printroute) and a term (e.g., which can vary between “prints” by the sameprint routine) that is output in a same log record by the printstatement whenever the print routine is executed. The invariant stringmay provide context for the term, such as a textual description that isintended for human understanding of the term, and which does not changebetween each repetition of the print routine or other output routinewhich generates the log record. Terms can include user names, IPaddresses of the host machine nodes, event identifiers, valuescharacterizing an application state (e.g., variable and/or registervalues) at an instant of time, processing error codes, etc, that canvary between each repetition of the print routine or other outputroutine which generates the log record.

Responsive to the invariant string in the received log record, thestratified sampling computer 120 selects a template identifier, fromamong the template identifiers within the template repository 122, for atemplate string matching the invariant string of the log record. Thesampling computer 120 selects a sampling count threshold from among aset of sampling count thresholds based on the template identifier andthe term of the log record. The sampling computer 120 obtains atemplate-term count based on a number of earlier log records, which havebeen received since the count was reset, that have a template identifierand a term that match the template identifier and the term of the logrecord. Based on the template-term count satisfying the sampling countthreshold, the sampling computer 120 generates an index entry in thesampled log records index 130 based on the log record and resets thetemplate-term count to a defined value. In contrast, based on thetemplate-term count not satisfying the sampling count threshold, thesampling computer 120 increments the template-term count.

When generating the index entry in the sampled log records index 130,the sampling computer 120 may store the template identifier, the term,and an identifier for the log record in the sampled records index 130.The sampled log records index 130 can be significantly compressed inmemory size by storing the template identifier instead of the invariantstring. The sampled log records index 130 can be used to perform keywordsearching of log records.

A search engine 140 is provided that allows a user, via user equipment150 (e.g., desktop computer, laptop computer, tablet computer, smartphone, etc.), to perform searches of content of the log records. Inaccordance with some embodiments, the searches are performed using thesample log records index 130. The search engine 140 receives searchquery defining a search string and a time period to be searched, anddetermines a range of log records to be searched based on the timeperiod. The search engine 140 selects a set of index entries in thesampled log records index 130 based on the range of log records, andretrieves the set of index entries from the sampled log records index130. For each of the index entries in the set, the search engine 140identifies a template identifier and a term of the index entry,retrieves the template string corresponding to the template identifierfrom the template repository 122, and generates a log record based onthe template string retrieved and the term of the index entry. Thesearch engine 140 then searches for the search string defined by thesearch query among the log records generated from the index entries inthe set, and returns log records, identified by the search as containingthe search string, as a response to the search query.

FIG. 2 is a flowchart of operations by the stratified sampling computer120 for determining based on various template-term sampling countthresholds when log records of a stream are to be indexed in the sampledlog records index 130, in accordance with some embodiments.

Referring to FIG. 2, the stratified sampling computer 120 receives(block 200) a log record that contains an invariant string and a term.The log record may include a plurality of invariant strings and/or aplurality of terms. The log record may be identified by a log ID whichis unique across all of the log records. The log ID can be generatedbased on a timestamp for when the log record was generated. Thetimestamp is translated into some time unit (based on the log recordgeneration rate) since a fixed point in time (called epoch). Forexample, suppose the time unit is seconds, the epoch is (Jan. 1, 200100:00:00 AM) and the log record was generated on Jun. 1, 2015 at10:45:15 AM. The logID will be the difference between the timestamp andthe epoch, converted into seconds.

The stratified sampling computer 120 identifies (block 202) a templateidentifier within the template repository 122 for a template stringmatching the invariant string of the log record.

In one embodiment, the template identifier is identified based on apredefined rule that identifies the structure of invariant string(s) andterm(s) that are output by print routines or other output routines of asoftware source which is the source of the log stream. When thisstructure is not predefined by a rule, the template identifier can beinferred using one of the following embodiments.

In one embodiment, the template identifier is identified based onparsing content of the log record to generate strings, comparing thestrings to template strings within the template repository 122,identifying one of the strings of the log record as the invariant stringbased on a match between the one of the strings and one of the templatestrings, and selecting the template identifier associated with the oneof the template strings.

In another embodiment, the template identifier is identified based onparsing content of a plurality of log records that includes the logrecord to generate strings, comparing the strings to template stringswithin the template repository 122, identifying one of the strings ofselected ones of the log records as the invariant string based on atleast a threshold number of matches occurring between the one of thestrings of the selected ones of the log records to a same one of thetemplate strings within the template repository 122, and selecting thetemplate identifier associated with the one of the template strings.

In another embodiment, the template identifier is identified based onparsing content of a sequence of the log records that includes the logrecord to generate strings, comparing the strings to template stringswithin the template repository 122 that are ordered in a definedsequence that is output by a defined software source on the host machinenode 10, identifying one of the strings as the invariant string based ona match between the one of the strings and one of the template stringsand further based on a previous match identified between one of thestrings of a previous one of the log records in the sequence and aprevious one of the template strings in the defined sequence, andselecting the template identifier associated with the one of thetemplate strings.

When the stratified sampling computer 120 identifies (block 204) that notemplate string in the template repository 122 matches the invariantstring of the log record, a new template identifier for the invariantstring of the log record is generated. The new template identifier andthe invariant string of the log record are then stored (block 206) inthe template repository 122 with a defined logical association betweenthe new template identifier and the invariant string of the log record.

The stratified sampling computer 120 selects (block 208) a samplingcount threshold from among a set of sampling count thresholds based onthe template identifier and the term of the log record. The samplingcount threshold can be unique to a particular combination of templateidentifier and term or can be used for a plurality of definedcombinations of template identifiers in terms.

The set of sampling count thresholds may reside in a repository 220. Inone embodiment, a first sampling count threshold is defined for one ormore combinations of template identifiers and terms that occurinfrequently (e.g., occurring below a first threshold rate), and secondsampling count threshold is defined for another one or more combinationsof template identifiers and terms that occur more frequently (e.g.,occur above the first threshold rate and below a second threshold rate),a third sampling count threshold is defined for still one or more othercombinations of template identifiers and terms that occur even morefrequently (e.g., occur above the second threshold rate), etc.

A sampling count threshold may be generated based on how frequently aparticular combination of template identifier and term has occurredhistorically in a stream of log records. In one embodiment, for acombination of a template identifier and a term occurring in earlier logrecords received from the host machine node, the stratified samplingcomputer 120 counts a number of occurrences of the combination of thetemplate identifier and the term in the earlier log records to generatea historical count, generates a new sampling count threshold for thecombination of the template identifier and the term based on thehistorical count, and stores the new sampling count threshold in the setof sampling count thresholds (e.g., within the repository 220) with alogical association to the template identifier and the term.

In one approach when generating the new sampling count threshold for thecombination of the template identifier and the term based on thehistorical count, the new sampling count threshold can be decreasedbased on less frequent occurrence of the combination of templateidentifier and term indicated by the historical count. In contrast, thenew sampling count threshold can be increased based on more frequentoccurrence of the combination of template identifier and term indicatedby the historical count.

In another approach when generating the new sampling count threshold forthe combination of the template identifier and the term based on thehistorical count, a first value is defined for the new sampling countthreshold based on the historical count being less than a firstthreshold level defined based on a predicted frequency of problematicoperation of the host machine node being reported in log reports. Incontrast, a second value, which is greater than the first value, isdefined for the new sampling count threshold based on the historicalcount being greater than a second threshold level that is greater thanthe first threshold level defined based on a predicted frequency ofnon-problematic operation of the host machine node being reported in logreports.

Sampling count thresholds may alternatively or additionally be generateddynamically based on observations made on the content of a stream ofreceived log records. For example, for each of a plurality ofcombinations of template identifiers and terms occurring in log recordsreceived as part of the log stream, the stratified sampling computer 120can count a number of occurrences of the combination of templateidentifier and term in the log records to generate a historical count,and generate a new sampling count threshold for the combination oftemplate identifier and term based on the historical count. The newsampling count threshold is stored in the set of sampling countthresholds (e.g. the repository 220) with a logical association to thecombination of template identifier and term. Thus, as new templatestrings and/or terms are observed as content within the stream of logrecords the stratified sampling computer 120 can operate to generate newsampling count thresholds for each of those combinations. Whengenerating new sampling count thresholds the computer 120 may generate aunique sampling count threshold for each combination or may associate aplurality of observed combinations having similar frequency ofoccurrences to a same group having a same sampling count threshold.Thus, some combinations of template identifiers and terms that occurwithin a first range of rates can be associated with a first samplingcount threshold, while some other combinations of template identifiersand terms that occur within a second range of rates can be a associatedwith a second sampling count threshold, and so on.

A sampling count threshold may alternatively or additionally begenerated based on a percentage value that is received via a userinterface from user. The percentage value identifies a percentage ofoccurrences of the combination of template identifier and term in a logrecord that are to be indexed in the sampled log records index 130.

With continued reference to FIG. 2, the stratified sampling computer 120determines whether it is observing a first occurrence of a combinationof template identifier and term and, if so, the log record is indexed(block 216) in the sampled log records index 130. The indexing of logrecords in the sampled log records index 130 may form an inverted indexstructure that logically associates the template identifier, the term,and an identifier for the log record. Accordingly, at least one logrecord containing each unique combination of template identifier andterm is indexed in the sampled log records index 130. The sampled logrecords index 130 may require many orders of magnitude less storagespace than would be required by a full index of all log records in thelog record repository 112. A template-term counter for the combinationof template identifier and term is incremented (block 218).

When the combination of template identifier and term is determined(block 210) to have occurred before, the stratified sampling computer120 further determines (block 212) whether a template-term counter forthe combination of template identifier and term satisfies the samplingcount threshold that was selected (block 208). The template-term countercan operate to uniquely track the number of occurrences of theparticular combination of template identifier and term, which arepresently observed in the log record, that have occurred in earlier logrecords which have been received since the counter was last reset.Accordingly, the stratified sampling computer 120 may maintain a tableof template-term counter values each associated with a differentcombination of template identifier and term. A template-term count canbe selected from among the plurality of template-term counts, which areeach associated with a different combination of template identifier andterm, based on the template identifier and the term from the log record.

When the template-term counter satisfies the sampling count threshold,the template-term counter is reset (block 214) to a defined value (e.g.,reset to 0), and the log record is indexed (block 216) in the sampledlog records index 130. Accordingly, samples of a stream of log recordsare selected for indexing to the sampled log records index 130 each timeanother threshold number of a defined combination of template identifierand term are observed in log records.

For example, when the sampling count threshold is defined to cause0.0001% of a defined combination of template identifier and term to beindexed in the sampled log records index 130, each one millionthoccurrence of a log record containing that defined combination oftemplate identifier and term is indexed in the sampled log records index130. The sampled log record is not selected merely because it is the onemillionth occurrence of a log record in the stream, but instead becausewithin the perhaps 100's of million log records that have been receivedin the stream it is the one millionth occurrence of a log record in thestream containing the defined combination of template identifier andterm. Further to this example, for another combination of templateidentifier and term that occurs much less frequently and which may beknown to be associated with erroneous operation of the host machine node10, each tenth occurrence of a log record containing that othercombination of template identifier and term is indexed in the sampledlog records index 130.

The template-term counter for the combination of template identifier andterm is then incremented (block 218).

Searching Log Records Using the Sampled Log Records Index

FIG. 3 is a flowchart of operations that may be performed by the searchengine 140 for searching the sampled log records index 130, based on asearch term and time period defined by a received search query, toretrieve template strings and terms which are used to generate logrecords, and to identify which of those log records satisfy the searchterm, in accordance with some embodiments. Referring to FIGS. 1 and 3, asearch query is received (block 300) that defines a search string and atime period to be searched. A range of log records to be searched isdetermined (block 302) based on the time period. A set of index entriesin the sampled log records index 130 is selected (block 304) based onthe range of log records. The set of index entries is retrieved (block306) from the sampled log records index 130. For each of the indexentries in the set operations (block 308) repeat to identify (block 310)a template identifier and a term of the index entry, retrieve (block312) the template string corresponding to the template identifier fromthe template repository 122, and generate (block 314) a log record basedon the template string retrieved and the term of the index entry.Searches (block 316) are performed among the log records generated fromthe index entries in the set to identify the search string defined bythe search query. Log records which were identified by the search ascontaining the search string, are returned (block 318) as a response tothe search query.

When returning (block 318) the log records, each invariant string andterm may be tagged with an indication of the frequency with which itoccurred. A user can determine characteristics of the returned resultsby observing the frequency of the invariant strings and/or terms. A usermay use the returned results to generate another search query havedifferent search terms and/or logical combinations thereof based on whatthe user learned from the returned results. The new search query may berun again all log records stored in the log record repository 112 toobtain a complete listing of log records that were received in thestream which satisfy conditions of the new search query. It is notedthat searches performed using the sampled log records index 130 returnless log records than a search on the log record repository 112 or afull index of all log records in the log record repository 112, but maybe performed using far less computational and storage resources and/ormuch faster.

Example Log Stream Analysis Computer

FIG. 4 is a block diagram of the log stream analysis computer 100 or acomponent thereof in FIG. 1 configured according to one embodiment.Referring to FIG. 4, a processor 400 may include one or more dataprocessing circuits, such as a general purpose and/or special purposeprocessor (e.g., microprocessor and/or digital signal processor) thatmay be collocated or distributed across one or more networks. Theprocessor 400 is configured to execute computer readable program code ina memory 410, described below as a computer readable medium, to performsome or all of the operations and methods disclosed herein for one ormore of the embodiments. The program code can include stratifiedsampling code 412 configured, log record processing code 414, searchcode 416, the template repository 122, the sampled log records index130, and/or the log record repository 112. The stratified sampling code412 can be computer readable program code to perform at least some ofthe operations disclosed herein for the stratified sampling computer 120and associated operations of FIG. 2. The log record processing code 414can be computer readable program code to perform to perform at leastsome of the operations disclosed herein regarding processing of a streamof log records by the log record processor 110. The search code 416 canbe computer readable program code to perform to perform at least some ofthe operations disclosed herein regarding searching for log records,such as the operations of FIG. 3. Although a single memory block 410 hasbeen illustrated for simplicity, it is to be understood that any number,combination of types, and hierarchy of memory storage devices (e.g,solid state memory devices, local disk drives, networked disk drives,etc.) can be used. A network interface 404 can communicatively connectthe processor 400 to the host machine nodes 10, the search engine 140,and the user equipment 150 shown in FIG. 1.

FURTHER DEFINITIONS AND EMBODIMENTS

In the above-description of various embodiments of the presentdisclosure, aspects of the present disclosure may be illustrated anddescribed herein in any of a number of patentable classes or contextsincluding any new and useful process, machine, manufacture, orcomposition of matter, or any new and useful improvement thereof.Accordingly, aspects of the present disclosure may be implemented inentirely hardware, entirely software (including firmware, residentsoftware, micro-code, etc.) or combining software and hardwareimplementation that may all generally be referred to herein as a“circuit,” “module,” “component,” or “system.” Furthermore, aspects ofthe present disclosure may take the form of a computer program productcomprising one or more computer readable media having computer readableprogram code embodied thereon.

Any combination of one or more computer readable media may be used. Thecomputer readable media may be a computer readable signal medium or acomputer readable storage medium. A computer readable storage medium maybe, for example, but not limited to, an electronic, magnetic, optical,electromagnetic, or semiconductor system, apparatus, or device, or anysuitable combination of the foregoing. More specific examples (anon-exhaustive list) of the computer readable storage medium wouldinclude the following: a portable computer diskette, a hard disk, arandom access memory (RAM), a read-only memory (ROM), an erasableprogrammable read-only memory (EPROM or Flash memory), an appropriateoptical fiber with a repeater, a portable compact disc read-only memory(CD-ROM), an optical storage device, a magnetic storage device, or anysuitable combination of the foregoing. In the context of this document,a computer readable storage medium may be any tangible medium that cancontain, or store a program for use by or in connection with aninstruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signalwith computer readable program code embodied therein, for example, inbaseband or as part of a carrier wave. Such a propagated signal may takeany of a variety of forms, including, but not limited to,electro-magnetic, optical, or any suitable combination thereof. Acomputer readable signal medium may be any computer readable medium thatis not a computer readable storage medium and that can communicate,propagate, or transport a program for use by or in connection with aninstruction execution system, apparatus, or device. Program codeembodied on a computer readable signal medium may be transmitted usingany appropriate medium, including but not limited to wireless, wireline,optical fiber cable, RF, etc., or any suitable combination of theforegoing.

Computer program code for carrying out operations for aspects of thepresent disclosure may be written in any combination of one or moreprogramming languages, including an object oriented programming languagesuch as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C++, C#, VB.NET,Python or the like, conventional procedural programming languages, suchas the “C” programming language, Visual Basic, Fortran 2003, Perl, COBOL2002, PHP, ABAP, dynamic programming languages such as Python, Ruby andGroovy, or other programming languages. The program code may executeentirely on the user's computer, partly on the user's computer, as astand-alone software package, partly on the user's computer and partlyon a remote computer or entirely on the remote computer or server. Inthe latter scenario, the remote computer may be connected to the user'scomputer through any type of network, including a local area network(LAN) or a wide area network (WAN), or the connection may be made to anexternal computer (for example, through the Internet using an InternetService Provider) or in a cloud computing environment or offered as aservice such as a Software as a Service (SaaS).

Aspects of the present disclosure are described herein with reference toflowchart illustrations and/or block diagrams of methods, apparatus(systems), and computer program products according to embodiments of thedisclosure. It will be understood that each block of the flowchartillustrations and/or block diagrams, and combinations of blocks in theflowchart illustrations and/or block diagrams, can be implemented bycomputer program instructions. These computer program instructions maybe provided to a processor of a general purpose computer, specialpurpose computer, or other programmable data processing apparatus toproduce a machine, such that the instructions, which execute via theprocessor of the computer or other programmable instruction executionapparatus, create a mechanism for implementing the functions/actsspecified in the flowchart and/or block diagram block or blocks.

These computer program instructions may also be stored in a computerreadable medium that when executed can direct a computer, otherprogrammable data processing apparatus, or other devices to function ina particular manner, such that the instructions when stored in thecomputer readable medium produce an article of manufacture includinginstructions which when executed, cause a computer to implement thefunction/act specified in the flowchart and/or block diagram block orblocks. The computer program instructions may also be loaded onto acomputer, other programmable instruction execution apparatus, or otherdevices to cause a series of operational steps to be performed on thecomputer, other programmable apparatuses or other devices to produce acomputer implemented process such that the instructions which execute onthe computer or other programmable apparatus provide processes forimplementing the functions/acts specified in the flowchart and/or blockdiagram block or blocks.

It is to be understood that the terminology used herein is for thepurpose of describing particular embodiments only and is not intended tobe limiting of the invention. Unless otherwise defined, all terms(including technical and scientific terms) used herein have the samemeaning as commonly understood by one of ordinary skill in the art towhich this disclosure belongs. It will be further understood that terms,such as those defined in commonly used dictionaries, should beinterpreted as having a meaning that is consistent with their meaning inthe context of this specification and the relevant art and will not beinterpreted in an idealized or overly formal sense unless expressly sodefined herein.

The flowchart and block diagrams in the figures illustrate thearchitecture, functionality, and operation of possible implementationsof systems, methods, and computer program products according to variousaspects of the present disclosure. In this regard, each block in theflowchart or block diagrams may represent a module, segment, or portionof code, which comprises one or more executable instructions forimplementing the specified logical function(s). It should also be notedthat, in some alternative implementations, the functions noted in theblock may occur out of the order noted in the figures. For example, twoblocks shown in succession may, in fact, be executed substantiallyconcurrently, or the blocks may sometimes be executed in the reverseorder, depending upon the functionality involved. It will also be notedthat each block of the block diagrams and/or flowchart illustration, andcombinations of blocks in the block diagrams and/or flowchartillustration, can be implemented by special purpose hardware-basedsystems that perform the specified functions or acts, or combinations ofspecial purpose hardware and computer instructions.

The terminology used herein is for the purpose of describing particularaspects only and is not intended to be limiting of the disclosure. Asused herein, the singular forms “a”, “an” and “the” are intended toinclude the plural forms as well, unless the context clearly indicatesotherwise. It will be further understood that the terms “comprises”and/or “comprising,” when used in this specification, specify thepresence of stated features, integers, steps, operations, elements,and/or components, but do not preclude the presence or addition of oneor more other features, integers, steps, operations, elements,components, and/or groups thereof. As used herein, the term “and/or”includes any and all combinations of one or more of the associatedlisted items. Like reference numbers signify like elements throughoutthe description of the figures.

The corresponding structures, materials, acts, and equivalents of anymeans or step plus function elements in the claims below are intended toinclude any disclosed structure, material, or act for performing thefunction in combination with other claimed elements as specificallyclaimed. The description of the present disclosure has been presentedfor purposes of illustration and description, but is not intended to beexhaustive or limited to the disclosure in the form disclosed. Manymodifications and variations will be apparent to those of ordinary skillin the art without departing from the scope and spirit of thedisclosure. The aspects of the disclosure herein were chosen anddescribed in order to best explain the principles of the disclosure andthe practical application, and to enable others of ordinary skill in theart to understand the disclosure with various modifications as aresuited to the particular use contemplated.

The invention claimed is:
 1. A method performed by a stratified samplingcomputer configured to index log records of a computing network in adatacenter, the method comprising: receiving a log record as part of astream of log records from a host machine node of the computing network,the log record comprising an invariant string identifying a problematicoperation of the host machine node and a variant term; selecting atemplate identifier, from among a plurality of template identifierswithin a template repository, for a template string matching theinvariant string of the log record; selecting a sampling count thresholdfrom among a set of sampling count thresholds based on a combination ofthe template identifier and the term of the log record, the samplingcount threshold defining a number of occurrences of the combination ofthe template identifier and the term of the log record required togenerate an index entry in a sampled log records index for the logrecord; determining whether a number of occurrences of the combinationof the template identifier and the variant term satisfy the selectedsampling count threshold; responsive to determining the number ofoccurrences of the combination of the template identifier and thevariant term satisfy the selected sampling count threshold, generatingan index entry in the sampled log records index based on the log record,the index entry excluding the invariant string and comprising thetemplate identifier, the variant term, and an identifier for the logrecord; responsive to determining the number of occurrences of thecombination of the template identifier and the variant term do notsatisfy the selected sampling count threshold, preventing generation ofthe index entry in the sample log records index; responsive to receivinga search query containing a search string from a user equipmentoperating in the computer network, retrieving a set of index entriesfrom the sampled log records index; for each of the index entries in theset, identifying a template identifier and a variant term of the indexentry, retrieving the template string corresponding to the templateidentifier from the template repository, and generating a log recordbased on the template string retrieved and the variant term of the indexentry; searching for the search string comprised in the search queryfrom among the log records generated from the index entries in the set;and returning, to the user equipment, the log records generated from theindex entries in the set, identified by the search as containing thesearch string, as a response to the search query.
 2. The method of claim1, further comprising: obtaining a template-term count based on a numberof earlier log records that were received since the count was reset andhave a template identifier and a variant term that match the templateidentifier and the term of the log record; based on the template-termcount not satisfying the sampling count threshold, incrementing thetemplate-term count; and wherein generating the index entry in a sampledlog records index based on the log record further comprises generatingthe index entry in the sampled log records index based on thetemplate-term count satisfying the sampling count threshold.
 3. Themethod of claim 2, wherein the obtaining a template-term count based ona number of earlier log records that were received since the count wasreset and have a template identifier and a term that match the templateidentifier and the term of the log record, comprises: selecting thetemplate-term count from among a plurality of template-term counts eachassociated with a different combination of template identifier and term,based on the template identifier and the variant term of the log record.4. The method of claim 2, further comprising: for a combination of atemplate identifier and a variant term occurring in earlier log recordsreceived from the host machine node, counting a number of occurrences ofthe combination of the template identifier and the variant term in theearlier log records to generate a historical count; generating a newsampling count threshold for the combination of the template identifierand the variant term based on the historical count; and storing the newsampling count threshold in the set of sampling count thresholds with alogical association to the template identifier and the variant term. 5.The method of claim 4, wherein the generating a new sampling countthreshold for the combination of the template identifier and the variantterm based on the historical count, comprises: decreasing the newsampling count threshold based on less frequent occurrence of thecombination of the template identifier and the variant term indicated bythe historical count; and increasing the new sampling count thresholdbased on more frequent occurrence of the combination of the templateidentifier and the variant term indicated by the historical count. 6.The method of claim 4, wherein the generating a new sampling countthreshold for the combination of the template identifier and the variantterm based on the historical count, comprises: defining a first valuefor the new sampling count threshold based on the historical count beingless than a first threshold level defined based on a predicted frequencyof problematic operation of the host machine node being reported in logreports; and defining a second value, which is greater than the firstvalue, for the new sampling count threshold based on the historicalcount being greater than a second threshold level that is greater thanthe first threshold level defined based on a predicted frequency ofnon-problematic operation of the host machine node being reported in logreports.
 7. The method of claim 2, further comprising: for each of aplurality of combinations of template identifiers and variant termsoccurring in log records received as part of the log stream, counting anumber of occurrences of the combination of template identifier andvariant term in the log records to generate a historical count,generating a new sampling count threshold for the combination oftemplate identifier and variant term based on the historical count, andstoring the new sampling count threshold in the set of sampling countthresholds with a logical association to the combination of templateidentifier and variant term.
 8. The method of claim 2, furthercomprising: generating a new sampling count threshold based on apercentage value received via a user interface from a user for apercentage of occurrences of a combination of template identifier andvariant term in a log record that are to be indexed in the sampled logrecords index; and storing the new sampling count threshold in the setof sampling count thresholds with a logical association to thecombination of template identifier and variant term.
 9. The method ofclaim 1, wherein receiving the search query defining a search stringcomprises receiving a time period to be searched; the method furthercomprising: determining a range of log records to be searched based onthe time period; selecting the set of index entries in the sampled logrecords index based on the range of log records.
 10. The method of claim1, wherein the selecting the template identifier, from among theplurality of template identifiers within the template repository, forthe template string matching the invariant string of the log record,comprises: parsing content of the log record to generate strings;comparing the strings to template strings within the templaterepository; identifying one of the strings of the log record as theinvariant string based on a match between the one of the strings and oneof the template strings; and selecting the template identifierassociated with the one of the template strings.
 11. The method of claim1, wherein the selecting the template identifier, from among theplurality of template identifiers within the template repository, forthe template string matching the invariant string of the log record,comprises: parsing content of a plurality of log records that includesthe log record to generate strings; comparing the strings to templatestrings within the template repository; identifying one of the stringsof selected ones of the log records as the invariant string based on atleast a threshold number of matches occurring between the one of thestrings of the selected ones of the log records to a same one of thetemplate strings within the template repository; and selecting thetemplate identifier associated with the one of the template strings. 12.The method of claim 1, wherein the selecting the template identifier,from among the plurality of template identifiers within the templaterepository, for the template string matching the invariant string of thelog record, comprises: parsing content of a sequence of the log recordsthat includes the log record to generate strings; comparing the stringsto template strings within the template repository that are ordered in adefined sequence that is output by a defined software source on the hostmachine node; identifying one of the strings as the invariant stringbased on a match between the one of the strings and one of the templatestrings and further based on a previous match identified between one ofthe strings of a previous one of the log records in the sequence and aprevious one of the template strings in the defined sequence; andselecting the template identifier associated with the one of thetemplate strings.
 13. A computer program product comprising: a computerreadable non-transitory storage medium of a stratified sampling computerhaving computer readable program code embodied therewith, the computerreadable program code comprising: computer readable program code toreceive a log record as part of a stream of log records from a hostmachine node of the computing network, the log record comprising aninvariant string identifying a problematic operation of the host machinenode and a variant term; computer readable program code to select atemplate identifier, from among a plurality of template identifierswithin a template repository, for a template string matching theinvariant string of the log record; computer readable program code toselect a sampling count threshold from among a set of sampling countthresholds based on a combination of the template identifier and theterm of the log record, the sampling count threshold defining a numberof occurrences of the combination of the template identifier and theterm of the log record required to generate an index entry in a sampledlog records index for the log record; computer readable program code todetermine whether a number of occurrences of the combination of thetemplate identifier and the variant term satisfy the selected samplingcount threshold; computer readable program code to generate an indexentry in the sampled log records index based on the log record, theindex entry excluding the invariant string and comprising the templateidentifier, the variant term, and an identifier for the log record inresponse to determining the number of occurrences of the combination ofthe template identifier and the variant term satisfy the selectedsampling count threshold; computer readable program code to preventgeneration of the index entry in the sample log records index inresponse to determining the number of occurrences of the combination ofthe template identifier and the variant term do not satisfy the selectedsampling count threshold; computer readable program code to retrieve aset of index entries from the sampled log records index in response toreceiving a search query containing a search string from a userequipment operating in the computer network; computer readable programcode to, for each of the index entries in the set: identify a templateidentifier and a variant term of the index entry, retrieve the templatestring corresponding to the template identifier from the templaterepository, and generate a log record based on the template stringretrieved and the term of the index entry; computer readable programcode to search for the search string comprised in the search query fromamong the log records generated from the index entries in the set; andcomputer readable program code to return log records, identified by thesearch as containing the search string, as a response to the searchquery.
 14. The computer program product of claim 13, further comprising:computer readable program code to obtain a template-term count based ona number of earlier log records that were received since the count wasreset and have a template identifier and a term that match the templateidentifier and the term of the log record; computer readable programcode to, based on the template-term count not satisfying the samplingcount threshold, increment the template-term count; and wherein thecomputer readable program code to generate the index entry in thesampled log records index based on the log record further comprisescomputer readable program code to generate the index entry and reset thetemplate term count to a defined value based on the template-term countsatisfying the sampling count threshold.
 15. The computer programproduct of claim 14, wherein the computer readable program code toretrieve a set of index entries from the sampled log records indexfurther comprises: computer readable program code to receive the searchquery defining the search string and a time period to be searched;computer readable program code to determine a range of log records to besearched based on the time period; computer readable program code toselect the set of index entries in the sampled log records index basedon the range of log records.
 16. The computer program product of claim14, wherein the computer readable program code to obtain a template-termcount based on a number of earlier log records that were received sincethe count was reset and have a template identifier and a variant termthat match the template identifier and the term of the log record,comprises: computer readable program code to select the template-termcount from among a plurality of template-term counts each associatedwith a different combination of template identifier and variant term,based on the template identifier and the variant term of the log record.17. The computer program product of claim 14, further comprising:computer readable program code to, for a combination of a templateidentifier and a variant term occurring in earlier log records receivedfrom the host machine node, count a number of occurrences of thecombination of the template identifier and the variant term in theearlier log records to generate a historical count; computer readableprogram code to generate a new sampling count threshold for thecombination of the template identifier and the variant term based on thehistorical count; and computer readable program code to store the newsampling count threshold in the set of sampling count thresholds with alogical association to the template identifier and the variant term. 18.The computer program product of claim 17, wherein the computer readableprogram code to generate a new sampling count threshold for thecombination of the template identifier and the variant term based on thehistorical count, comprises: computer readable program code to define afirst value for the new sampling count threshold based on the historicalcount being less than a first threshold level defined based on apredicted frequency of problematic operation of the host machine nodebeing reported in log reports; and computer readable program code todefine a second value, which is greater than the first value, for thenew sampling count threshold based on the historical count being greaterthan a second threshold level that is greater than the first thresholdlevel defined based on a predicted frequency of non-problematicoperation of the host machine node being reported in log reports. 19.The computer program product of claim 14, further comprising: computerreadable program code to, for each of a plurality of combinations oftemplate identifiers and variant terms occurring in log records receivedas part of the log stream, count a number of occurrences of thecombination of template identifier and variant term in the log recordsto generate a historical count, generate a new sampling count thresholdfor the combination of template identifier and variant term based on thehistorical count, and store the new sampling count threshold in the setof sampling count thresholds with a logical association to thecombination of template identifier and variant term.
 20. The computerprogram product of claim 13, wherein the computer readable program codeto select the template identifier, from among the plurality of templateidentifiers within the template repository, for the template stringmatching the invariant string of the log record, comprises: computerreadable program code to parse content of a plurality of log recordsthat includes the log record to generate strings; computer readableprogram code to compare the strings to template strings within thetemplate repository; computer readable program code to identify one ofthe strings of selected ones of the log records as the invariant stringbased on at least a threshold number of matches occurring between theone of the strings of the selected ones of the log records to a same oneof the template strings within the template repository; and computerreadable program code to select the template identifier associated withthe one of the template strings.